EggNog-Mapper initial test analysis(2025-01-10 to 2025-01-14)

Introduction

Aaron emailed me back with advice on a way to make eggNog-mapper/2.1.12 on hawk work. As i am revising for an exam i cant spend too long on this. p.s. i think i did pretty good on the exam

Methods

From the 10th to the 12th i tried running a [[test file]] i made earlier in december 2024, i may not have written about it as i didnt get anywhere at that time. This process was also hindered by hawk being very encumbered in the new year meaning it takes a good half day for any results to appear. On the 11th i managed to get a successful result with the .xlsx file i need for the heatmaps for accession 3Dt1c. Today(the 12th) i set off 3 jobs each containing 3 accessions to hopefully get the results of the remaining 9. If there are no complications i will then be in a place where i can obtain the .fastas for the online comparison accessions and then run them through eggNog-mapper. It should be noted that i used the same list of parameters i found on the web version of hawk for this, [[[[[[[[[[screenshot attached]]]]]]]]]]]]. I had some marginal success, the second set of three finished in over 3 hours, the other 2 sets are still going strong after 8, so ill let them time out over night and see what i have, maybe they will complete, i did give them 12 hours.I then discovered the command dbmem which could help to speed up the process. I tested with [[this file]] set to run over night from roughly 9 30 pm on the 12th to 3 30 am on the 13th, totalling 6 hours for 5 files, not great. I then experimented with taking out some of the arguments --evalue 0.001 --score 60 --pident 40 --query_cover 20 --subject_cover 20, that produced [[this script]]. Tomorrow(14th) i will have a look at recreating run 1 with dbmem switched on.

Results

Run one had mixed results, 3 sets of 3 ran in parallel, set 1 completed in 3 and a half hours, good. set 2 took 8 and a half hours, bad, set 3 timed out, very bad. so dont have the outputs needed for 1Dt100h or 1Dt1h. Run 2 was more successful, with 5 solid looking outputs in 6 hours. Run 3 did not improve on that time despite the extra parameters being cut.

Conclusion

It could have been overcrowding on hawk, however it appears that running multiple sets in parallel adversely affects the result, however, i have not tested this with dbmem on. The large problem is the volume of samples required, the desired output is 3 heatmaps comparing KO pathways: - Comparing genera inside sphingomonadaceae - Comparing genera inside Microbacteriaceae - Comparing the genera containing just our samples

Additional pre-eggnog analysis + API (2025-01-17 to 2025-01-__)

Introduction

Hawk is still running slow, so while those jobs from the previous section are running, i figured i would like to do something extra to fill time. I wanted to see just how many samples we are going to need to pull down and process for this and get a rough time estimate for that, as well as the API stuff to pull the files down as that is by far the easy bit

Methods

I started this by making tables based along the sets of samples i am going to need off of the ncbi website. I identified 3 groups of samples: 1. all the genera containing our flye_asm samples 2. all the genera in the family sphingomonadaceae 3. all the genera in the family Microbacteriaceae

This fits with the specification of work i was given. Being 2 analyses, 1 for comparing just our genera and another comparing genera in families we have multiple samples in. As of now i am yet to do the API call.

Results

All tables

Family Sphingomonadaceae

Table 10 - count of genera in the family Sphingomonadaceae
accessions as found in the .tree file outputted by gtdbtk analysis done on 2024-12-24
Family and Genus Number of Accessions
f__Sphingomonadaceae
g__34-65-8 1
g__Actirhodobacter 1
g__Alg239-R122 1
g__Allosphingosinicella 20
g__Alteraurantiacibacter 20
g__Altererythrobacter_D 2
g__Altererythrobacter_F 1
g__Altericroceibacterium 3
g__Altericroceibacterium_A 1
g__Alteripontixanthobacter 1
g__Alteriqipengyuania 9
g__Alteriqipengyuania_A 2
g__Blastomonas 8
g__CADCVW01 1
g__CAHJWT01 4
g__CFH-75059 1
g__Caenibius 5
g__Chakrabartia 9
g__Croceibacterium 15
g__Croceicoccus 10
g__Erythrobacter 66
g__GCA-014117445 1
g__Glacieibacterium 2
g__Hankyongella 1
g__JACXVD01 1
g__Novosphingobium 115
g__Novosphingopyxis 2
g__Pacificimonas 4
g__Parapontixanthobacter 1
g__Parasphingopyxis 7
g__Parasphingorhabdus 18
g__Paraurantiacibacter 1
g__Parerythrobacter 2
g__Pelagerythrobacter 5
g__Polymorphobacter 9
g__Polymorphobacter_A 1
g__Pontixanthobacter 6
g__Pseudopontixanthobacter 2
g__Pseudopontixanthobacter_A 2
g__QFOP01 1
g__Qipengyuania 29
g__Rhizorhabdus 14
g__Rhizorhapis 2
g__SCN-67-18 1
g__Sandaracinobacter 4
g__Sandarakinorhabdus 7
g__Sphingobium 77
g__Sphingobium_A 2
g__Sphingomicrobium 38
g__Sphingomonas 205
g__Sphingomonas_B 6
g__Sphingomonas_D 1
g__Sphingomonas_E 3
g__Sphingomonas_G 5
g__Sphingomonas_H 1
g__Sphingomonas_I 5
g__Sphingomonas_K 1
g__Sphingomonas_L 2
g__Sphingomonas_M 1
g__Sphingomonas_N 6
g__Sphingopyxis 62
g__Sphingorhabdus_B 25
g__Sphingorhabdus_C 2
g__Sphingosinicella 4
g__Tardibacter 1
g__Thermaurantiacus 1
g__Tsuneonella 10
g__UBA1936 3
g__UBA6174 2
g__XMGL2 1
g__ZODW24 1
g__Zymomonas 3
Total 887

Family Microbacteriaceae

Table 11 - count of genera in the family Microbacteriaceae
accessions as found in the .tree file outputted by gtdbtk analysis done on 2024-12-24
Family and Genus Number of Accessions
f__Microbacteriaceae
g__73-13 2
g__Agreia 6
g__Agrococcus 16
g__Agromyces 41
g__Agromyces_B 1
g__Alpinimonas 1
g__Amnibacterium 2
g__Aquiluna 15
g__Aurantimicrobium 3
g__CAIOLM01 1
g__Canibacter 4
g__Chryseoglobus 8
g__Clavibacter 17
g__Cnuibacter 1
g__Compostimonas 1
g__Conyzicola 3
g__Cryobacterium 43
g__Cryobacterium_C 1
g__Curtobacterium 51
g__Cx-87 1
g__Diaminobutyricibacter 1
g__Diaminobutyricimonas 2
g__Frigoribacterium 15
g__Frondihabitans 5
g__Galbitalea 2
g__Glaciibacter 1
g__Glaciihabitans 2
g__Gryllotalpicola 3
g__Gulosibacter 9
g__Herbiconiux 7
g__Homoserinimonas 4
g__Humibacter 4
g__JAAFHU01 1
g__JAFIQW01 1
g__Klugiella 1
g__Labedella 4
g__Lacisediminihabitans 5
g__Leifsonia 19
g__Leifsonia_A 4
g__Leifsonia_B 1
g__Leucobacter 43
g__Lumbricidophila 1
g__Lysinibacter 1
g__MWH-TA3 7
g__Marinisubtilis 3
g__Marisediminicola 4
g__Microbacterium 254
g__Microbacterium_A 4
g__Microcella 3
g__Microterricola 7
g__Mycetocola 3
g__Mycetocola_A 5
g__Mycetocola_B 1
g__NC76-1 1
g__Naasia 4
g__OACT-916 1
g__Okibacterium 2
g__Planctomonas 2
g__Plantibacter 6
g__Pontimonas 10
g__Protaetiibacter 9
g__Pseudoclavibacter 9
g__Pseudoclavibacter_A 3
g__Pseudolysinimonas 5
g__RFQD01 2
g__Rathayibacter 22
g__Rhodoglobus 15
g__Rhodoluna 35
g__Root112D2 1
g__SCRE01 1
g__Schumannella 4
g__Subtercola 9
g__Terrimesophilobacter 3
g__Tropheryma 1
g__UBA3913 2
g__UBA963 5
g__WSTA01 2
g__Yonghaparkia 5
g__ZJ450 2
Total 806

Our Genera

Table 12 - count of genera from accessions produced at Bangor
accessions as found in the .tree file outputted by gtdbtk analysis done on 2024-12-24
Genus Number of Accessions
g__ 1
g__Brachybacterium 32
g__Brevibacterium 43
g__Microbacterium 254
g__Pantoea 52
g__Sphingomonas 205
Total 587

Conclusion

📌 ?: TODO: [go and compare the content of runs 2 and 3] [eggnog-mapper on hawk is slow, possibly too slow to scale to where we need it to be. alternatives: > what scale are we looking at - 1693 samples for the family comparison > could limit to genera with more than 30 samples > screenscrape-method > enlist more manpower to do online(if we have to do on web)]